Goto

Collaborating Authors

 Torfaen


Self-Improving Robust Preference Optimization

Choi, Eugene, Ahmadian, Arash, Geist, Matthieu, Pietquin, Oilvier, Azar, Mohammad Gheshlaghi

arXiv.org Machine Learning

Reinforcement Learning from Human Feedback (RLHF) (Christiano et al., 2017) has rapidly become a standard method to align Large Language Models (LLMs). One of the main practical issues that all the prominent existing RLHF methods (offline or online) (Ouyang et al., 2022; Rafailov et al., 2023; Azar et al., 2023; Zhao et al., 2023b; Ahmadian et al., 2024) encounter is that their optimal solution heavily depends on the training task in terms of the distribution used to generate the preference data (behavior policy) (Munos et al., 2023; Azar et al., 2023). This makes the existing RLHF methods prone to out-of-distribution (OOD) tasks (Li et al., 2024; Kirk et al., 2024) where the evaluation distribution is significantly different from that of the behavior policy. Also, whenever the base/SFT models significantly differ from the behavior policy, the dependency of the RLHF solutions on the behavior policy makes the preference dataset and reward model less useful (Gao et al., 2022) as RLHF may undo the SFT/pretraining. To address this challenge, we introduce an alternative approach for aligning LLMs from human preferences based on more principled and robust foundations. Our goal is to find a solution that is robust to the changes in the preference dataset, meaning that changes in the distribution from which the completions are sampled do not affect the final outcome of learning significantly. To achieve this goal, we exploit the concept of self-improving (Huang et al., 2022; Bai et al., 2022) language models. By self-improving LLM we refer to a model capable of enhancing its outputs recursively with each inference iteration.


We should be pleased that robots are taking over some of our old jobs

#artificialintelligence

Mark Carney knows how to illustrate economic trends through the use of creative language. And when he talks, people tend to listen. "The massacre of the Dilberts" was how the governor of the Bank of England encapsulated the fear that middle-management jobs would be wiped out by automation – for people unfamiliar with American cartoon strips, Dilbert is a white collar office worker and the strip mocks the absurdities of office life. In his native Canada this week, Carney made a number of points in a speech on automation. Most obviously, many office jobs done by people would be done by computers, a process that was already well advanced. "When I look back 30 years ago, what I used to do in the City of London when I worked at an investment bank, probably about three-quarters of what I did is now done by machine," he said.


Remote-controlled 'flying squad' to chase criminals

Daily Mail - Science & tech

The first 24-hour police drone unit is to be launched, amid fears that forces may have to rely on them because of falling officer numbers. The'flying squad' will pursue suspects, find missing people and help solve murders. Assistant Chief Constable Steve Barry, national spokesman on drones, predicted forces across Britain would soon be using them as they are cheaper than helicopters and can perform some duties of bobbies on the beat. But the move has prompted privacy concerns and warnings that the technology should'never be an excuse to cut officers'. Devon and Cornwall Police has advertised for a drone manager to lead its new dedicated unit, which will be launched in the summer and shared with Dorset. Devon and Cornwall Police will launch a dedicated 24-hour police drone unit to save costs on helicopters.